Bacterial genome-wide association study substantiates papGII of Escherichia coli as a major risk factor for urosepsis

Background Urinary tract infections (UTIs) are among the most common bacterial infections worldwide, often caused by uropathogenic Escherichia coli. Multiple bacterial virulence factors or patient characteristics have been linked separately to progressive, more invasive infections. In this study, we aim to identify pathogen- and patient-specific factors that drive the progression to urosepsis by jointly analysing bacterial and host characteristics. Methods We analysed 1076 E. coli strains isolated from 825 clinical cases with UTI and/or bacteraemia by whole-genome sequencing (Illumina). Sequence types (STs) were determined via srst2 and capsule loci via fastKaptive. We compared the isolates from urine and blood to confirm clonality. Furthermore, we performed a bacterial genome-wide association study (bGWAS) (pyseer) using bacteraemia as the primary clinical outcome. Clinical data were collected by an electronic patient chart review. We concurrently analysed the association of the most significant bGWAS hit and important patient characteristics with the clinical endpoint bacteraemia using a generalised linear model (GLM). Finally, we designed qPCR primers and probes to detect papGII-positive E. coli strains and prospectively screened E. coli from urine samples (n = 1657) at two healthcare centres. Results Our patient cohort had a median age of 75.3 years (range: 18.00–103.1) and was predominantly female (574/825, 69.6%). The bacterial phylogroups B2 (60.6%; 500/825) and D (16.6%; 137/825), which are associated with extraintestinal infections, represent the majority of the strains in our collection, many of which encode a polysaccharide capsule (63.4%; 525/825). The most frequently observed STs were ST131 (12.7%; 105/825), ST69 (11.0%; 91/825), and ST73 (10.2%; 84/825). Of interest, in 12.3% (13/106) of cases, the E. coli pairs in urine and blood were only distantly related. In line with previous bGWAS studies, we identified the gene papGII (p-value < 0.001), which encodes the adhesin subunit of the E. coli P-pilus, to be associated with ‘bacteraemia’ in our bGWAS. In our GLM, correcting for patient characteristics, papGII remained highly significant (odds ratio = 5.27, 95% confidence interval = [3.48, 7.97], p-value < 0.001). An independent cohort of cases which we screened for papGII-carrying E. coli at two healthcare centres further confirmed the increased relative frequency of papGII-positive strains causing invasive infection, compared to papGII-negative strains (p-value = 0.033, chi-squared test). Conclusions This study builds on previous work linking papGII with invasive infection by showing that it is a major risk factor for progression from UTI to bacteraemia that has diagnostic potential. Supplementary Information The online version contains supplementary material available at 10.1186/s13073-023-01243-x.


Phylogroup Sequence Type
: a: Significance level and average effect size of genes with mapping unitigs identified as significant in a bGWAS including all clinical cases (n=751 complete observations) (right) and including cases for which the port of entry for bacteraemia could be assigned to the urinary tract (n=612 complete observations) (right).In the right figure only genes with a maximum -log10(p-value) > 11 are labelled.Genes with locus tags 100888-20_01189, 100033-19_04615 and 100033-19_04621 are labelled as papJ_2, papJ_3 and papI_2, respectively, as they were identified as such.The gene with the locus tags 100033-19_03452 are labelled as 'hp' (= hypothetical protein); Odds ratio estimates with 95% confidence intervals for b: Typical urinary tract infection symptoms (n = 717 complete observations with 213 events); c: Admission to the intensive care unit (n = 751 complete observations with 172 events); d: 30-day all cause mortality (n = 749 complete observations with 45 events); using the generalised linear model (GLM).e: Performance of GLM classifiers using 'Invasive disease' as outcome variable and the same dataset as and variables as in the GLM as input (751 complete observations with 210 events), either including the presence of papGII as a predictor or not.Error bars indicate the standard deviation and the means were compared using paired Wilcoxon tests.OR = odds ratio; CI = confidence interval; CCI = Charlson Comorbidity Index; 'AUROC': area under the receiver operating curve; 'NPV': negative predictive value; 'PPV': positive predictive value; 'ns' = not significant; '*' = p-value < 0.05; '**' = p-value < 0.01; '***' = p-value < 0.001
Table 1: Primers which were used in the endpoint PCR assay at centre 1. Primers with an asterisk (*) were newly designed for this study.

Name
Target Amplified [bp] Temp. [°C] Sequence (5'-3')    The resulting PCR products were separated using a 2% agarose gel electrophoresis in 1X TAE buffer solution, at 40 Volt during 50 minutes.We used the 'Safe Red' dye to visualise the products and recorded the gel on a 'Gel Visualizer Fusion X'.

Centre 2
At centre 2, two sets of primers and probes for the E. coli core gene rpoD were newly designed using Primer-BLAST (4) (Table 5).Both sets were tested for their functionality using endpoint PCR.We used 1 μl of E. coli frozen stocks (E. coli strains isolated in routine diagnostics and preserved in 1ml of skim milk media) as template.

Quantitative PCR
Centre 1 In the next step, we evaluated the performance of our primers in a quantitative PCR (qPCR) assay.Table 6 summarises the composition of the qPCR reactions and Table 7 summarises the thermocycler program used.We tested the specificity of the primers gapC_2, uidA, papC_1 and papGII on 32 strains (Table 8).

Figure S4 :
Figure S4: Core genome phylogeny of 825 E. coli strains.Columns represent (from left to right): the assigned phylogroup, the sequence type, phenotypic resistance against ceftriaxone, meropenem, fosfomycin, nitrofurantoin and ciprofloxacin.

Figure S6 :
Figure S6: Average Nucleotide Identity values for unitigs identified in our bGWAS and mapping to papG (X-axis) and the reference sequences for the five papG variants (Y-axis).

FigureFigure S9 :Figure S11 :
Figure S8: C-reactive protein concentration (a) and leucocyte count (b) measured in blood samples of cases, for which a papGII positive or a papGII negative E. coli strain was isolated from a urine or a blood culture samples.Leucocyte counts were measured on the day the urine / bloodculture samples were taken.

Figure S12 :
Figure S12:Occurrence of MALDI-TOF mass peaks in spectra acquired from E. coli strains of different phylogroups.'Occurrence' refers to the percentage of spectra per group in which a peak was detected.Each strain was measured in quadruplicate either on a Microflex Biotyper device, or an Axmina Confidence device.Phylogroups for which less than five strains were available (E1, E2 and G) were excluded from the plot.Masses are only depicted if detected in > 50% or < 25% of spectra for one or more of the groups.

Figure S13 :Figure S14 :Figure S15 :
Figure S13: Core genome phylogeny of the E. coli strains collected for this study (one strain per clinical case, n=825).Phylogroup assignment, Sequence Type (ST) (eight most frequent ones coloured, more rare STs in grey), papG variant, mass of HdeA, predicted from the amino acid sequence.0.01 Substitutions per site

Figure S16 :
Figure S16:Variants of primer and probe sequences detected in our genome collection (n=1,076).Sequences used in the qPCR assay are indicated in blue and alternative variants detected in the genomes are depicted in black.Variants were called using the variantcaller Freebayes via snippy and using a minimum coverage of 20x.

FigureFigure S18 :
Figure S17: (a) Efficiency of the primer pairs in the single reaction (blue) and in a triplex reaction (orange) for the primers used at center 1 (gapC, papC and papGII).(b) Amplification curves of primers used at center 2: rpoD and papGII in duplex reactions and of papGII in a triplex reaction with rpoD and papC.

Average Nucleotide Identity -Case 3 Phylogeny of Cases with Multiple Strains
Within host genetic diversity of E. coli strains isolated from the same clinical cases.
a: core genome phylogeny of E. coli strains (n=225), isolated from the same clinical case (n=106), colored by phylogroup.The numbers correspond to the case identifier and strains were only labelled, if they exhibited < 99.9% Average Nucleotide Identity to the strain isolated from the same clinical case.b: papG variant encoded by isolates which exhibited < 99.9% Average Nucleotide Identity to the strain isolated from the same clinical case.c: Average Nucleotide Identity of strains isolated from case 3. d: SNV of 10 picked isolates from three cases, either from urine or blood culture samples.

Table 2
summarises the composition of the endpoint PCR reactions and Table3the thermocycler program used.The concentration of the genomic DNA was determined using Qubit (Invitrogen, Waltham, USA) and ranged between 21ng/μl -114ng/μl.

Table 2 :
Composition of the Endpoint PCR reactions

Table 3 :
Thermocycler program used for the Endpoint PCR reaction.

Table 4 ,
we assessed the functionality of the primers and tested for cross-reactions between the primers designed for papGII and papGIII, as there are the two papG variants which are genetically most similar (3).

Table 4 :
Endpoint PCR experimental design to test for the functionality of all primers and the cross reactivity between the primers for papGII and papGIII.

Table 5 :
Primers which were used in the endpoint PCR assay at centre 2. Primers with an asterisk (*)were newly designed for this study.

Table 6 :
Composition of the qPCR assay performed at centre 1

Table 7 :
Thermocycler program used for the qPCR reaction performed at centre 1